The telecom industry is highly competitive, and understanding customer behavior is crucial for reducing churn and maximizing customer lifetime value (LTV). This project aims to analyze customer data to perform segmentation based on usage patterns and demographic information. Additionally, the project will calculate and predict the lifetime value of customers, enabling the telecom company to identify high-value customers and develop targeted retention strategies.
The primary objective of the TelecomChurn project is to enhance customer retention and maximize lifetime value (LTV) by leveraging advanced customer segmentation and predictive analytics. This will be achieved through the following key goals:
Customer Segmentation:
Calculate and Predict Customer Lifetime Value (LTV):
Targeted Retention Strategies:
By implementing these strategies, the TelecomChurn project aims to provide actionable insights into customer behavior, enhance customer engagement, and significantly improve the telecom company's competitive edge in the market.
So, join us as we embark on this data-driven journey, exploring the behaviors and trends within the telecom industry. We aim to uncover valuable insights into customer segmentation and predict customer lifetime value, providing a deeper understanding of customer retention and profitability.
`1. Daniel McDinna 2. Oluwakayode Onakoya 3. Ebunoluwa Adebukola 4. Anuoluwa Ojebode 5. Adekeye AbdulRasheed
# Let's begin!
#Import neccessary libraries for our analysis and insight exploration on the Dataset
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')
from math import pi
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.cm as cm
import sqlite3
import math
from datetime import datetime
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
# Read in the dataset to our notebook
telecom_data = pd.read_csv(r"C:\Users\USER PC\OneDrive\Documents\Business and Education\INGRYD\Project\Capstone Project\Telecom_Churn.csv")
telecom_data
| customer_id | telecom_partner | gender | age | state | city | pincode | date_of_registration | num_dependents | estimated_salary | calls_made | sms_sent | data_used | churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Reliance Jio | F | 25 | Karnataka | Kolkata | 755597 | 2020-01-01 | 4 | 124962 | 44 | 45 | -361 | 0 |
| 1 | 2 | Reliance Jio | F | 55 | Mizoram | Mumbai | 125926 | 2020-01-01 | 2 | 130556 | 62 | 39 | 5973 | 0 |
| 2 | 3 | Vodafone | F | 57 | Arunachal Pradesh | Delhi | 423976 | 2020-01-01 | 0 | 148828 | 49 | 24 | 193 | 1 |
| 3 | 4 | BSNL | M | 46 | Tamil Nadu | Kolkata | 522841 | 2020-01-01 | 1 | 38722 | 80 | 25 | 9377 | 1 |
| 4 | 5 | BSNL | F | 26 | Tripura | Delhi | 740247 | 2020-01-01 | 2 | 55098 | 78 | 15 | 1393 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 243548 | 243549 | Airtel | F | 28 | Mizoram | Kolkata | 110295 | 2023-05-03 | 3 | 130580 | 28 | 9 | 4102 | 0 |
| 243549 | 243550 | Reliance Jio | F | 52 | Assam | Kolkata | 713481 | 2023-05-03 | 0 | 82393 | 80 | 45 | 7521 | 0 |
| 243550 | 243551 | Reliance Jio | M | 59 | Tripura | Kolkata | 520218 | 2023-05-03 | 4 | 51298 | 26 | 4 | 6547 | 0 |
| 243551 | 243552 | BSNL | M | 49 | Madhya Pradesh | Kolkata | 387744 | 2023-05-03 | 2 | 83981 | 80 | 15 | 1125 | 0 |
| 243552 | 243553 | BSNL | F | 37 | Telangana | Hyderabad | 139086 | 2023-05-04 | 0 | 144297 | 61 | 7 | 3384 | 0 |
243553 rows × 14 columns
# We delve into the info of the Dataset
telecom_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 243553 entries, 0 to 243552 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customer_id 243553 non-null int64 1 telecom_partner 243553 non-null object 2 gender 243553 non-null object 3 age 243553 non-null int64 4 state 243553 non-null object 5 city 243553 non-null object 6 pincode 243553 non-null int64 7 date_of_registration 243553 non-null object 8 num_dependents 243553 non-null int64 9 estimated_salary 243553 non-null int64 10 calls_made 243553 non-null int64 11 sms_sent 243553 non-null int64 12 data_used 243553 non-null int64 13 churn 243553 non-null int64 dtypes: int64(9), object(5) memory usage: 26.0+ MB
# Get the full description of the Dataset
telecom_data.describe()
| customer_id | age | pincode | num_dependents | estimated_salary | calls_made | sms_sent | data_used | churn | |
|---|---|---|---|---|---|---|---|---|---|
| count | 243553.000000 | 243553.000000 | 243553.000000 | 243553.000000 | 243553.000000 | 243553.000000 | 243553.000000 | 243553.000000 | 243553.000000 |
| mean | 121777.000000 | 46.077609 | 549501.270541 | 1.997500 | 85021.137839 | 49.010548 | 23.945404 | 4993.186025 | 0.200478 |
| std | 70307.839393 | 16.444029 | 259808.860574 | 1.414941 | 37508.963233 | 29.453556 | 14.733575 | 2942.019547 | 0.400359 |
| min | 1.000000 | 18.000000 | 100006.000000 | 0.000000 | 20000.000000 | -10.000000 | -5.000000 | -987.000000 | 0.000000 |
| 25% | 60889.000000 | 32.000000 | 324586.000000 | 1.000000 | 52585.000000 | 24.000000 | 11.000000 | 2490.000000 | 0.000000 |
| 50% | 121777.000000 | 46.000000 | 548112.000000 | 2.000000 | 84990.000000 | 49.000000 | 24.000000 | 4987.000000 | 0.000000 |
| 75% | 182665.000000 | 60.000000 | 774994.000000 | 3.000000 | 117488.000000 | 74.000000 | 36.000000 | 7493.000000 | 0.000000 |
| max | 243553.000000 | 74.000000 | 999987.000000 | 4.000000 | 149999.000000 | 108.000000 | 53.000000 | 10991.000000 | 1.000000 |
# chek for null values in the dataset
telecom_data.isna().sum()
customer_id 0 telecom_partner 0 gender 0 age 0 state 0 city 0 pincode 0 date_of_registration 0 num_dependents 0 estimated_salary 0 calls_made 0 sms_sent 0 data_used 0 churn 0 dtype: int64
# Select numerical columns for outlier detection
numerical_columns = ['customer_id', 'age', 'pincode', 'num_dependents', 'estimated_salary', 'sms_sent', 'data_used', 'churn']
# Function to detect outliers using IQR method
def detect_outliers(df, cols):
outliers = pd.DataFrame()
for col in cols:
q1 = df[col].quantile(0.25)
q3 = df[col].quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
outliers_col = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
outliers = pd.concat([outliers, outliers_col], axis=0)
return outliers
# Detect outliers in numerical columns
outliers_df = detect_outliers(telecom_data, numerical_columns)
# Display the outliers
print("Below are the outliers in the Telecom Churn dataset:")
outliers_df
Below are the outliers in the Telecom Churn dataset:
| customer_id | telecom_partner | gender | age | state | city | pincode | date_of_registration | num_dependents | estimated_salary | ... | monthly_calls | monthly_sms | monthly_data | R | F | M | RFM_Segment | RFM_Score | Cluster | LTV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 3 | Vodafone | F | 57 | Arunachal Pradesh | Delhi | 423976 | 2020-01-01 | 0 | 148828 | ... | 0.912477 | 0.446927 | 3.594041 | 4 | 1 | 4 | 414 | 9 | 0 | 1.747473e+08 |
| 3 | 4 | BSNL | M | 46 | Tamil Nadu | Kolkata | 522841 | 2020-01-01 | 1 | 38722 | ... | 1.489758 | 0.465549 | 174.618250 | 4 | 4 | 1 | 441 | 9 | 1 | 1.620697e+09 |
| 7 | 8 | BSNL | M | 46 | Arunachal Pradesh | Kolkata | 866786 | 2020-01-01 | 4 | 104541 | ... | 1.620112 | 0.744879 | 41.806331 | 4 | 1 | 3 | 413 | 8 | 3 | 1.094575e+09 |
| 10 | 11 | Airtel | M | 44 | Uttarakhand | Chennai | 776250 | 2020-01-01 | 0 | 133288 | ... | 0.130354 | 0.130354 | 23.743017 | 4 | 1 | 4 | 414 | 9 | 0 | 7.583823e+08 |
| 16 | 17 | Vodafone | M | 61 | Himachal Pradesh | Hyderabad | 734068 | 2020-01-01 | 2 | 59723 | ... | 0.297952 | 0.782123 | 151.899441 | 4 | 4 | 2 | 442 | 10 | 2 | 2.165676e+09 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 243523 | 243524 | Reliance Jio | F | 38 | Gujarat | Delhi | 112512 | 2023-05-03 | 0 | 69163 | ... | 6.564885 | 0.763359 | 559.007634 | 1 | 3 | 2 | 132 | 6 | 2 | 5.526996e+08 |
| 243527 | 243528 | BSNL | F | 50 | Uttar Pradesh | Mumbai | 613558 | 2023-05-03 | 3 | 126994 | ... | 2.977099 | 0.152672 | 243.206107 | 1 | 2 | 4 | 124 | 7 | 0 | 4.414201e+08 |
| 243529 | 243530 | Reliance Jio | M | 45 | Punjab | Kolkata | 190431 | 2023-05-03 | 2 | 110127 | ... | 4.427481 | 1.832061 | 217.709924 | 1 | 2 | 3 | 123 | 6 | 3 | 3.480357e+08 |
| 243533 | 243534 | Reliance Jio | F | 22 | West Bengal | Kolkata | 732518 | 2023-05-03 | 2 | 123399 | ... | 5.267176 | 0.687023 | 34.198473 | 1 | 1 | 4 | 114 | 6 | 0 | 6.991451e+07 |
| 243535 | 243536 | BSNL | M | 32 | Andhra Pradesh | Delhi | 439538 | 2023-05-03 | 1 | 104748 | ... | 4.732824 | 1.145038 | 362.061069 | 1 | 2 | 3 | 123 | 6 | 3 | 5.438294e+08 |
48827 rows × 28 columns
# Outliers count
outliers_df.count()
customer_id 48827 telecom_partner 48827 gender 48827 age 48827 state 48827 city 48827 pincode 48827 date_of_registration 48827 num_dependents 48827 estimated_salary 48827 calls_made 48827 sms_sent 48827 data_used 48827 churn 48827 dtype: int64
# Distribution of 'gender'
sns.countplot(x='gender', data=telecom_data)
plt.title('Distribution of Gender')
plt.show()
# Distribution of 'state'
plt.figure(figsize=(12, 6))
sns.countplot(x='state', data=telecom_data, order=telecom_data['state'].value_counts().index)
plt.title('Distribution of State')
plt.xticks(rotation=90)
plt.show()
# Distribution of 'age'
sns.histplot(telecom_data['age'], bins=30, kde=True)
plt.title('Distribution of Age')
plt.xlabel('Age')
plt.ylabel('Customer Count')
plt.show()
# Distribution of 'calls_made'
sns.histplot(telecom_data['calls_made'], bins=30, kde=True)
plt.title('Distribution of Calls Made')
plt.xlabel('Calls Made')
plt.ylabel('Customer Count')
plt.show()
# Distribution of 'sms_sent'
sns.histplot(telecom_data['sms_sent'], bins=30, kde=True)
plt.title('Distribution of SMS Sent')
plt.xlabel('SMS Sent')
plt.ylabel('Customer Count')
plt.show()
# Distribution of 'data_used'
sns.histplot(telecom_data['data_used'], bins=30, kde=True)
plt.title('Distribution of Data Used')
plt.xlabel('Data Used')
plt.ylabel('Customer Count')
plt.show()
# Convert 'date_of_registration' to datetime
telecom_data['date_of_registration'] = pd.to_datetime(telecom_data['date_of_registration'])
# Assuming the current date is known
current_date = datetime.now()
# Recency: Days since last registration
telecom_data['recency'] = (current_date - telecom_data['date_of_registration']).dt.days
# Frequency: Assuming monthly interaction frequency over the period
# Here, you might need to adjust based on real transaction data if available
telecom_data['frequency'] = telecom_data['calls_made'] + telecom_data['sms_sent'] + telecom_data['data_used']
# Monetary: Total data usage can be considered as a proxy for monetary value
telecom_data['monetary'] = telecom_data['estimated_salary']
# Lets break down our df into more colums for simplicity
telecom_data['tenure_days'] = (pd.Timestamp.now() - telecom_data['date_of_registration']).dt.days
telecom_data['monthly_calls'] = telecom_data['calls_made'] / (telecom_data['tenure_days'] / 30)
telecom_data['monthly_sms'] = telecom_data['sms_sent'] / (telecom_data['tenure_days'] / 30)
telecom_data['monthly_data'] = telecom_data['data_used'] / (telecom_data['tenure_days'] / 30)
#telecom_data['LTV'] = telecom_data['estimated_salary'] * (telecom_data['tenure_days'] / 30)
# Define quantiles for RFM ranking
r_labels = range(1, 5)
f_labels = range(1, 5)
m_labels = range(1, 5)
# Assign quantiles to RFM values
telecom_data['R'] = pd.qcut(telecom_data['recency'], q=4, labels=r_labels).astype(int)
telecom_data['F'] = pd.qcut(telecom_data['frequency'].rank(method='first'), q=4, labels=f_labels).astype(int)
telecom_data['M'] = pd.qcut(telecom_data['monetary'].rank(method='first'), q=4, labels=m_labels).astype(int)
# Create RFM segment and score
telecom_data['RFM_Segment'] = telecom_data['R'].astype(str) + telecom_data['F'].astype(str) + telecom_data['M'].astype(str)
telecom_data['RFM_Score'] = telecom_data[['R', 'F', 'M']].sum(axis=1)
# Display the first few rows to confirm RFM scores
telecom_data[['customer_id', 'RFM_Segment', 'RFM_Score']].head()
| customer_id | RFM_Segment | RFM_Score | |
|---|---|---|---|
| 0 | 1 | 414 | 9 |
| 1 | 2 | 434 | 11 |
| 2 | 3 | 414 | 9 |
| 3 | 4 | 441 | 9 |
| 4 | 5 | 412 | 7 |
# Display the first few rows to confirm new features
telecom_data[['customer_id', 'recency', 'frequency', 'monetary']].head()
| customer_id | recency | frequency | monetary | |
|---|---|---|---|---|
| 0 | 1 | 1611 | -272 | 124962 |
| 1 | 2 | 1611 | 6074 | 130556 |
| 2 | 3 | 1611 | 266 | 148828 |
| 3 | 4 | 1611 | 9482 | 38722 |
| 4 | 5 | 1611 | 1486 | 55098 |
# Display the first few rows to confirm new features
telecom_data[['customer_id', 'R', 'F', 'M']].head()
| customer_id | R | F | M | |
|---|---|---|---|---|
| 0 | 1 | 4 | 1 | 4 |
| 1 | 2 | 4 | 3 | 4 |
| 2 | 3 | 4 | 1 | 4 |
| 3 | 4 | 4 | 4 | 1 |
| 4 | 5 | 4 | 1 | 2 |
# Define a list of colors
colors = ['blue', 'orange', 'green', 'red', 'purple', 'brown', 'pink', 'gray', 'olive', 'cyan',
'yellow', 'black', 'magenta', 'turquoise', 'lime']
# Calculate the number of rows needed
num_columns = len(telecom_data.columns) - 1
num_rows = math.ceil(num_columns / 3)
plt.figure(figsize=(20, 5 * num_rows))
# Loop through columns and plot histograms
i = 1
for col in telecom_data.columns[:-1]:
plt.subplot(num_rows, 3, i)
sns.histplot(telecom_data[col], color=colors[i % len(colors)]) # Use a different color for each plot
plt.title(col)
i += 1
plt.tight_layout()
plt.show()
# Lets Create a new DF for data transformation, and to keep the original df unaltered any further
# Then try nurmalization on the new df
scaler_df = pd.read_csv(r"C:\Users\USER PC\OneDrive\Documents\Business and Education\INGRYD\Project\Capstone Project\Telecom_Churn_Clone.csv")
# Normalization
scaler = StandardScaler()
scaler_df[['age', 'estimated_salary', 'calls_made', 'sms_sent', 'data_used']] = scaler.fit_transform(telecom_data[['age', 'estimated_salary', 'calls_made', 'sms_sent', 'data_used']])
# Display the first few rows to confirm transformations
scaler_df.head()
| customer_id | telecom_partner | gender | age | state | city | pincode | date_of_registration | num_dependents | estimated_salary | calls_made | sms_sent | data_used | churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Reliance Jio | F | -1.281782 | Karnataka | Kolkata | 755597 | 2020-01-01 | 4 | 1.064837 | -0.170117 | 1.429025 | -1.819905 | 0 |
| 1 | 2 | Reliance Jio | F | 0.542593 | Mizoram | Mumbai | 125926 | 2020-01-01 | 2 | 1.213975 | 0.441016 | 1.021791 | 0.333042 | 0 |
| 2 | 3 | Vodafone | F | 0.664218 | Arunachal Pradesh | Delhi | 423976 | 2020-01-01 | 0 | 1.701113 | -0.000358 | 0.003706 | -1.631599 | 1 |
| 3 | 4 | BSNL | M | -0.004720 | Tamil Nadu | Kolkata | 522841 | 2020-01-01 | 1 | -1.234351 | 1.052149 | 0.071578 | 1.490073 | 1 |
| 4 | 5 | BSNL | F | -1.220969 | Tripura | Delhi | 740247 | 2020-01-01 | 2 | -0.797761 | 0.984245 | -0.607145 | -1.223715 | 0 |
# confirm telecom_data is unaltered
telecom_data.head()
| customer_id | telecom_partner | gender | age | state | city | pincode | date_of_registration | num_dependents | estimated_salary | ... | monetary | tenure_days | monthly_calls | monthly_sms | monthly_data | R | F | M | RFM_Segment | RFM_Score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Reliance Jio | F | 25 | Karnataka | Kolkata | 755597 | 2020-01-01 | 4 | 124962 | ... | 124962 | 1611 | 0.819367 | 0.837989 | -6.722533 | 4 | 1 | 4 | 414 | 9 |
| 1 | 2 | Reliance Jio | F | 55 | Mizoram | Mumbai | 125926 | 2020-01-01 | 2 | 130556 | ... | 130556 | 1611 | 1.154562 | 0.726257 | 111.229050 | 4 | 3 | 4 | 434 | 11 |
| 2 | 3 | Vodafone | F | 57 | Arunachal Pradesh | Delhi | 423976 | 2020-01-01 | 0 | 148828 | ... | 148828 | 1611 | 0.912477 | 0.446927 | 3.594041 | 4 | 1 | 4 | 414 | 9 |
| 3 | 4 | BSNL | M | 46 | Tamil Nadu | Kolkata | 522841 | 2020-01-01 | 1 | 38722 | ... | 38722 | 1611 | 1.489758 | 0.465549 | 174.618250 | 4 | 4 | 1 | 441 | 9 |
| 4 | 5 | BSNL | F | 26 | Tripura | Delhi | 740247 | 2020-01-01 | 2 | 55098 | ... | 55098 | 1611 | 1.452514 | 0.279330 | 25.940410 | 4 | 1 | 2 | 412 | 7 |
5 rows × 26 columns
# Let's amend scaler_df with R F and M values
scaler_df['R'] = telecom_data['R']
scaler_df['F'] = telecom_data['F']
scaler_df['M'] = telecom_data['M']
scaler_df.head()
| customer_id | telecom_partner | gender | age | state | city | pincode | date_of_registration | num_dependents | estimated_salary | calls_made | sms_sent | data_used | churn | R | F | M | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Reliance Jio | F | -1.281782 | Karnataka | Kolkata | 755597 | 2020-01-01 | 4 | 1.064837 | -0.170117 | 1.429025 | -1.819905 | 0 | 4 | 1 | 4 |
| 1 | 2 | Reliance Jio | F | 0.542593 | Mizoram | Mumbai | 125926 | 2020-01-01 | 2 | 1.213975 | 0.441016 | 1.021791 | 0.333042 | 0 | 4 | 3 | 4 |
| 2 | 3 | Vodafone | F | 0.664218 | Arunachal Pradesh | Delhi | 423976 | 2020-01-01 | 0 | 1.701113 | -0.000358 | 0.003706 | -1.631599 | 1 | 4 | 1 | 4 |
| 3 | 4 | BSNL | M | -0.004720 | Tamil Nadu | Kolkata | 522841 | 2020-01-01 | 1 | -1.234351 | 1.052149 | 0.071578 | 1.490073 | 1 | 4 | 4 | 1 |
| 4 | 5 | BSNL | F | -1.220969 | Tripura | Delhi | 740247 | 2020-01-01 | 2 | -0.797761 | 0.984245 | -0.607145 | -1.223715 | 0 | 4 | 1 | 2 |
# Plot histograms for Recency, Frequency, and Monetary
plt.figure(figsize=(15, 5))
# Recency
plt.subplot(1, 3, 1)
plt.hist(telecom_data['recency'], bins=20, color='blue')
plt.title('Distribution of Recency')
plt.xlabel('Days Since Last Registration')
plt.ylabel('Number of Customers')
# Frequency
plt.subplot(1, 3, 2)
plt.hist(telecom_data['frequency'], bins=20, color='red')
plt.title('Distribution of Frequency')
plt.xlabel('Number of Interactions')
plt.ylabel('Number of Customers')
# Monetary
plt.subplot(1, 3, 3)
plt.hist(telecom_data['monetary'], bins=20, color='green')
plt.title('Distribution of Monetary Value')
plt.xlabel('Estimated Salary')
plt.ylabel('Number of Customers')
plt.tight_layout()
plt.show()
# Compute the correlation matrix
corr_matrix = telecom_data[['recency', 'frequency', 'monetary', 'R', 'F', 'M']].corr()
# Generate a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix of RFM Features')
plt.show()
From the heatmap, we deduce that...:
# Box plots for Recency, Frequency, and Monetary
plt.figure(figsize=(15, 5))
# Recency
plt.subplot(1, 3, 1)
sns.boxplot(y=telecom_data['recency'], color='blue')
plt.title('Box Plot of Recency')
# Frequency
plt.subplot(1, 3, 2)
sns.boxplot(y=telecom_data['frequency'], color='red')
plt.title('Box Plot of Frequency')
# Monetary
plt.subplot(1, 3, 3)
sns.boxplot(y=telecom_data['monetary'], color='green')
plt.title('Box Plot of Monetary Value')
plt.tight_layout()
plt.show()
print(f'From the chart, we deduce that...: ')
From the chart, we deduce that...:
# Pair-plots for Recency, Frequency, and Monetary
sns.pairplot(telecom_data[['recency', 'frequency', 'monetary']])
plt.suptitle('Pair Plot of RFM Features', y=1.02)
plt.show()
print(f'From the chart, we deduce that: monetary, frequency and recency are all weakly correlated')
From the chart, we deduce that: monetary, frequency and recency are all weakly correlated
# We aim at getting all the churn count and the respective telecom partners and other parameters
# Select specific columns for churned customers
churn_data = telecom_data[telecom_data['churn'] == 1][['telecom_partner', 'age', 'gender', 'estimated_salary', 'state', 'data_used']]
# Select specific columns for non-churned customers
non_churn_data = telecom_data[telecom_data['churn'] == 0][['telecom_partner', 'age', 'gender', 'estimated_salary', 'state', 'data_used']]
# Display the churn data
churn_data
| telecom_partner | age | gender | estimated_salary | state | data_used | |
|---|---|---|---|---|---|---|
| 2 | Vodafone | 57 | F | 148828 | Arunachal Pradesh | 193 |
| 3 | BSNL | 46 | M | 38722 | Tamil Nadu | 9377 |
| 7 | BSNL | 46 | M | 104541 | Arunachal Pradesh | 2245 |
| 10 | Airtel | 44 | M | 133288 | Uttarakhand | 1275 |
| 16 | Vodafone | 61 | M | 59723 | Himachal Pradesh | 8157 |
| ... | ... | ... | ... | ... | ... | ... |
| 243523 | Reliance Jio | 38 | F | 69163 | Gujarat | 7323 |
| 243527 | BSNL | 50 | F | 126994 | Uttar Pradesh | 3186 |
| 243529 | Reliance Jio | 45 | M | 110127 | Punjab | 2852 |
| 243533 | Reliance Jio | 22 | F | 123399 | West Bengal | 448 |
| 243535 | BSNL | 32 | M | 104748 | Andhra Pradesh | 4743 |
48827 rows × 6 columns
# View a list of telecom pateners with churn customers
company_names= list(churn_data['telecom_partner'].value_counts().keys())
company_names
['Airtel', 'Reliance Jio', 'Vodafone', 'BSNL']
# Plot a chart for churn distribution
# Get the value counts for each telecom partner in churn_data
company_churn_counts = churn_data['telecom_partner'].value_counts()
# Get the values and labels for the pie chart
company_churn_values = company_churn_counts.values
company_names = company_churn_counts.index
# Plot a chart for churn distribution
plt.figure(figsize=(10,8))
plt.pie(company_churn_values, labels=company_names, autopct='%.1f%%', labeldistance=1.15, wedgeprops={'linewidth': 3, 'edgecolor': 'white'})
plt.title("Churn Percentage of Telecom Partners")
plt.show()
# Get the value counts for each gender in churn_data
gender_churn_counts = churn_data['gender'].value_counts()
# Get the values and labels for the pie chart
gender_churn_values = gender_churn_counts.values
gender_labels = gender_churn_counts.index
# Plot a chart for churn distribution by gender
plt.figure(figsize=(10, 8))
plt.pie(gender_churn_values, labels=gender_labels, autopct='%.1f%%', labeldistance=1.15, wedgeprops={'linewidth': 3, 'edgecolor': 'white'})
plt.title("Churn Percentage by Gender")
plt.show()
# Plot histograms for age distribution
plt.figure(figsize=(14, 6))
# Churned customers
plt.subplot(1, 2, 1)
sns.histplot(churn_data['age'], kde=True, color='red', bins=30)
plt.title('Age Distribution of Churned Customers')
plt.xlabel('Age')
plt.ylabel('Frequency')
# Non-churned customers
plt.subplot(1, 2, 2)
sns.histplot(non_churn_data['age'], kde=True, color='blue', bins=30)
plt.title('Age Distribution of Non-Churned Customers')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()
# Plot histograms for estimated salary distribution
plt.figure(figsize=(14, 6))
# Churned customers
plt.subplot(1, 2, 1)
sns.histplot(churn_data['estimated_salary'], kde=True, color='red', bins=30)
plt.title('Estimated Salary Distribution of Churned Customers')
plt.xlabel('Estimated Salary')
plt.ylabel('Frequency')
# Non-churned customers
plt.subplot(1, 2, 2)
sns.histplot(non_churn_data['estimated_salary'], kde=True, color='blue', bins=30)
plt.title('Estimated Salary Distribution of Non-Churned Customers')
plt.xlabel('Estimated Salary')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()
# Get sorted value counts for data used
churn_value_data_used = churn_data['data_used'].value_counts().sort_index()
non_churn_value_data_used = non_churn_data['data_used'].value_counts().sort_index()
# Create a DataFrame for plotting
churn_df = pd.DataFrame({'data_used': churn_value_data_used.index, 'count': churn_value_data_used.values, 'churn_status': 'churn'})
non_churn_df = pd.DataFrame({'data_used': non_churn_value_data_used.index, 'count': non_churn_value_data_used.values, 'churn_status': 'non_churn'})
combined_df = pd.concat([churn_df, non_churn_df])
# Plot the scatter plot
plt.figure(figsize=(10, 5))
sns.scatterplot(data=combined_df, x='data_used', y='count', hue='churn_status')
plt.xlabel('Data Used')
plt.ylabel('Count')
plt.title('Scatter Plot of Data Usage by Churn and Non-Churn Customers')
plt.legend(title='Churn Status')
plt.show()
# Plot a clearer scatter plot
churn_value_calls_made=list(churn_data['data_used'].value_counts().sort_values())
non_churn_value_calls_made=list(non_churn_data['data_used'].value_counts().sort_values())
plt.figure(figsize=(10,5))
sns.scatterplot(churn_value_calls_made,label="churn")
sns.scatterplot(non_churn_value_calls_made,label="non churn")
plt.xlabel("Data used")
plt.title("scatterplot data usage made churn and non-churn customer")
Text(0.5, 1.0, 'scatterplot data usage made churn and non-churn customer')
# Plot a chart for churn distribution
# Get the value counts for each telecom partner in churn_data
churn_value_state_count = churn_data['state'].value_counts()
# Get the values and labels for the pie chart
churn_value_state_value = churn_value_state_count.values
state_names = churn_value_state_count.index
plt.figure(figsize=(20, 10))
plt.pie(churn_value_state_value,
autopct='%.1f%%',
labels=state_names,
labeldistance=1.15,
wedgeprops={'linewidth': 3, 'edgecolor': 'white'}
)
plt.title("Churn Percentage of States")
plt.show()
# Distribution of churn
plt.figure(figsize=(8, 6))
sns.countplot(data=telecom_data, x='churn')
plt.title('Distribution of Churn')
plt.xlabel('Churn')
plt.ylabel('Count')
plt.show()
# Relationship between churn and estimated_salary
plt.figure(figsize=(10, 6))
sns.boxplot(data=telecom_data, x='churn', y='estimated_salary')
plt.title('Estimated Salary vs Churn')
plt.xlabel('Churn')
plt.ylabel('Estimated Salary')
plt.show()
# Relationship between churn and RFM metrics
plt.figure(figsize=(14, 10))
plt.subplot(3, 1, 1)
sns.boxplot(data=telecom_data, x='churn', y='recency')
plt.title('Recency vs Churn')
plt.xlabel('Churn')
plt.ylabel('Recency')
plt.subplot(3, 1, 2)
sns.boxplot(data=telecom_data, x='churn', y='frequency')
plt.title('Frequency vs Churn')
plt.xlabel('Churn')
plt.ylabel('Frequency')
plt.subplot(3, 1, 3)
sns.boxplot(data=telecom_data, x='churn', y='monetary')
plt.title('Monetary Value vs Churn')
plt.xlabel('Churn')
plt.ylabel('Monetary Value')
plt.tight_layout()
plt.show()
recency, frequency, and monetary features for clustering.telecom_data DataFrame.Let us find the optimal number of clusters using the elbow method and then apply K-means clustering to segment our customers.
# Select features for clustering
rfm_features = telecom_data[['recency', 'frequency', 'monetary']]
# Determine the optimal number of clusters using the elbow method
sse = {}
for k in range(1, 11):
kmeans = KMeans(n_clusters=k, random_state=1)
kmeans.fit(rfm_features)
sse[k] = kmeans.inertia_
# Plot the elbow curve
plt.figure(figsize=(10, 6))
sns.pointplot(x=list(sse.keys()), y=list(sse.values()))
plt.xlabel('Number of Clusters')
plt.ylabel('SSE')
plt.title('Elbow Method for Optimal Number of Clusters')
plt.show()
# Apply K-means with the chosen number of clusters (let's say 4 for this example)
optimal_clusters = 4
kmeans = KMeans(n_clusters=optimal_clusters, random_state=1)
telecom_data['Cluster'] = kmeans.fit_predict(rfm_features)
# Display the first few rows to confirm clustering
print(telecom_data[['customer_id', 'Cluster']].head())
customer_id Cluster 0 1 0 1 2 0 2 3 0 3 4 1 4 5 2
# Display the first few rows to confirm clustering
print(telecom_data[['customer_id', 'Cluster']])
customer_id Cluster 0 1 0 1 2 0 2 3 0 3 4 1 4 5 2 ... ... ... 243548 243549 0 243549 243550 2 243550 243551 1 243551 243552 2 243552 243553 0 [243553 rows x 2 columns]
# Calculate cluster centers
cluster_centers = kmeans.cluster_centers_
# Create a DataFrame with the cluster centers
cluster_centers_df = pd.DataFrame(cluster_centers, columns=rfm_features.columns)
cluster_centers_df['Cluster'] = np.arange(1, len(cluster_centers) + 1)
# Plot cluster centers
plt.figure(figsize=(10, 6))
sns.heatmap(cluster_centers_df.set_index('Cluster'), annot=True, cmap='coolwarm')
plt.title('Cluster Centers')
plt.show()
# Pair plot of RFM features colored by cluster
plt.figure(figsize=(10, 6))
sns.pairplot(telecom_data, vars=['recency', 'frequency', 'monetary'], hue='Cluster', palette='viridis')
plt.title('Pair Plot of RFM Features by Cluster')
plt.show()
<Figure size 1000x600 with 0 Axes>
# Box plots for RFM features by cluster
plt.figure(figsize=(15, 5))
# Recency
plt.subplot(1, 3, 1)
sns.boxplot(x='Cluster', y='recency', data=telecom_data)
plt.title('Recency by Cluster')
# Frequency
plt.subplot(1, 3, 2)
sns.boxplot(x='Cluster', y='frequency', data=telecom_data)
plt.title('Frequency by Cluster')
# Monetary
plt.subplot(1, 3, 3)
sns.boxplot(x='Cluster', y='monetary', data=telecom_data)
plt.title('Monetary Value by Cluster')
plt.tight_layout()
plt.show()
# Distribution of Age by Cluster
plt.figure(figsize=(10, 6))
sns.histplot(data=telecom_data, x='age', hue='Cluster', multiple='stack')
plt.title('Age Distribution by Customer Segment')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
#Average Monthly Calls by Cluster
plt.figure(figsize=(10, 6))
sns.barplot(data=telecom_data, x='Cluster', y='monthly_calls', errorbar=None)
plt.title('Average Monthly Calls by Customer Segment')
plt.xlabel('Cluster')
plt.ylabel('Average Monthly Calls')
plt.show()
# Function to create radar chart
def create_radar_chart(row, title, color):
labels = row.index
num_vars = len(labels)
# Compute angle for each axis
angles = [n / float(num_vars) * 2 * pi for n in range(num_vars)]
angles += angles[:1]
# Create figure
fig, ax = plt.subplots(figsize=(6, 6), subplot_kw=dict(polar=True))
# Draw one axe per variable and add labels
plt.xticks(angles[:-1], labels, color='grey', size=8)
# Draw ylabels
ax.set_rlabel_position(0)
plt.yticks([0.2, 0.4, 0.6, 0.8, 1.0], ["0.2", "0.4", "0.6", "0.8", "1.0"], color="grey", size=7)
plt.ylim(0, 1)
# Plot data
values = row.tolist()
values += values[:1]
ax.plot(angles, values, color=color, linewidth=2, linestyle='solid')
ax.fill(angles, values, color=color, alpha=0.4)
plt.title(title, size=20, color=color, y=1.1)
# Assuming cluster_centers_df is your DataFrame with cluster centers and 'Cluster' column
# Normalize cluster centers for radar chart
scaler = StandardScaler()
cluster_centers_normalized = scaler.fit_transform(cluster_centers_df.drop('Cluster', axis=1))
cluster_centers_normalized_df = pd.DataFrame(cluster_centers_normalized, columns=rfm_features.columns)
# Plot radar chart for each cluster
for i in range(len(cluster_centers_df)):
row = cluster_centers_normalized_df.iloc[i]
create_radar_chart(row, f'Cluster {i+1}', color=sns.color_palette('viridis', len(cluster_centers_df))[i])
plt.show()
Let's use the estimated salary of customers, to device a marketing strategy, by segmentation.
print(f'The minimum salary is: {telecom_data["estimated_salary"].min()} and the maximum salary is: {telecom_data["estimated_salary"].max()}')
The minimum salary is: 20000 and the maximum salary is: 149999
# By statistics
min_salary = telecom_data['estimated_salary'].min()
max_salary = telecom_data['estimated_salary'].max()
salary_range = max_salary - min_salary
segment_size = salary_range / 3
low_income = min_salary
medium_income = min_salary + segment_size
high_income = min_salary + 2 * segment_size
print(f"Low income earners: {low_income} to {medium_income}")
print(f"Medium income earners: {medium_income} to {high_income}")
print(f"High income earners: {high_income} to {max_salary}")
Low income earners: 20000 to 63333.0 Medium income earners: 63333.0 to 106666.0 High income earners: 106666.0 to 149999
'''"estimated_salary" is the column representing estimated salary
Let's make High-Income salary > 80000
Mid-Income 50000 <= salary <= 80000
Low-Income salary < 50000
'''
# Define income segments based on estimated salary
def segment_by_salary(salary):
if salary > 80000:
return 'High-Income'
elif 50000 <= salary <= 80000:
return 'Mid-Income'
else:
return 'Low-Income'
# Apply segmentation function to create a new column 'income_segment'
telecom_data['income_segment'] = telecom_data['estimated_salary'].apply(segment_by_salary)
# Create personalized messaging based on income segments
def create_personalized_message(segment):
if segment == 'High-Income':
return "Unlock premium services tailored for your lifestyle!"
elif segment == 'Mid-Income':
return "Discover value-packed bundles designed for your needs!"
else:
return "Affordable plans with great features await you!"
# Apply personalized messaging function to create a new column 'marketing_message'
telecom_data['marketing_message'] = telecom_data['income_segment'].apply(create_personalized_message)
# Display a sample of segmented data with personalized messages
print(telecom_data[['customer_id', 'estimated_salary', 'income_segment', 'marketing_message']].head())
customer_id estimated_salary income_segment \
0 1 124962 High-Income
1 2 130556 High-Income
2 3 148828 High-Income
3 4 38722 Low-Income
4 5 55098 Mid-Income
marketing_message
0 Unlock premium services tailored for your life...
1 Unlock premium services tailored for your life...
2 Unlock premium services tailored for your life...
3 Affordable plans with great features await you!
4 Discover value-packed bundles designed for you...
# Connect to SQLite database
conn = sqlite3.connect('telecom_data.db')
# dataframe is already loaded into `telecom_data`
telecom_data.to_sql('telecom_data', conn, if_exists='replace', index=False)
# Define the SQL query
query = """
WITH AveragePurchaseValue AS (
SELECT
customer_id,
AVG(estimated_salary) AS avg_purchase_value
FROM telecom_data
GROUP BY customer_id
),
PurchaseFrequency AS (
SELECT
customer_id,
(SUM(calls_made) + SUM(sms_sent)+ SUM(data_used)) AS purchase_frequency
FROM telecom_data
GROUP BY customer_id
),
CustomerLifespan AS (
SELECT
customer_id,
(julianday('now') - julianday(date_of_registration)) AS customer_lifespan
FROM telecom_data
),
LTV_Calculation AS (
SELECT
a.customer_id,
a.avg_purchase_value,
b.purchase_frequency,
c.customer_lifespan,
(a.avg_purchase_value * b.purchase_frequency * c.customer_lifespan / 365.0) AS LTV
FROM
AveragePurchaseValue a
JOIN
PurchaseFrequency b ON a.customer_id = b.customer_id
JOIN
CustomerLifespan c ON a.customer_id = c.customer_id
JOIN
telecom_data d ON a.customer_id = d.customer_id
)
SELECT
customer_id,
LTV
FROM
LTV_Calculation
ORDER BY
customer_id ASC;
"""
# Execute the query and fetch results
ltv_df = pd.read_sql_query(query, conn)
# Display the LTV DataFrame
print(ltv_df.head())
# Close the connection
conn.close()
customer_id LTV 0 1 -1.500345e+08 1 2 3.500385e+09 2 3 1.747473e+08 3 4 1.620697e+09 4 5 3.614089e+08
telecom_data.head()
| customer_id | telecom_partner | gender | age | state | city | pincode | date_of_registration | num_dependents | estimated_salary | ... | tenure_days | monthly_calls | monthly_sms | monthly_data | R | F | M | RFM_Segment | RFM_Score | Cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Reliance Jio | F | 25 | Karnataka | Kolkata | 755597 | 2020-01-01 | 4 | 124962 | ... | 1611 | 0.819367 | 0.837989 | -6.722533 | 4 | 1 | 4 | 414 | 9 | 0 |
| 1 | 2 | Reliance Jio | F | 55 | Mizoram | Mumbai | 125926 | 2020-01-01 | 2 | 130556 | ... | 1611 | 1.154562 | 0.726257 | 111.229050 | 4 | 3 | 4 | 434 | 11 | 0 |
| 2 | 3 | Vodafone | F | 57 | Arunachal Pradesh | Delhi | 423976 | 2020-01-01 | 0 | 148828 | ... | 1611 | 0.912477 | 0.446927 | 3.594041 | 4 | 1 | 4 | 414 | 9 | 0 |
| 3 | 4 | BSNL | M | 46 | Tamil Nadu | Kolkata | 522841 | 2020-01-01 | 1 | 38722 | ... | 1611 | 1.489758 | 0.465549 | 174.618250 | 4 | 4 | 1 | 441 | 9 | 1 |
| 4 | 5 | BSNL | F | 26 | Tripura | Delhi | 740247 | 2020-01-01 | 2 | 55098 | ... | 1611 | 1.452514 | 0.279330 | 25.940410 | 4 | 1 | 2 | 412 | 7 | 2 |
5 rows × 27 columns
ltv_df
| customer_id | LTV | |
|---|---|---|
| 0 | 1 | -1.500345e+08 |
| 1 | 2 | 3.500385e+09 |
| 2 | 3 | 1.747473e+08 |
| 3 | 4 | 1.620697e+09 |
| 4 | 5 | 3.614089e+08 |
| ... | ... | ... |
| 243548 | 243549 | 5.821595e+08 |
| 243549 | 243550 | 6.785698e+08 |
| 243550 | 243551 | 3.634111e+08 |
| 243551 | 243552 | 1.103598e+08 |
| 243552 | 243553 | 5.351703e+08 |
243553 rows × 2 columns
# Distribution of LTV
plt.figure(figsize=(10, 6))
sns.histplot(data=ltv_df, x='LTV', kde=True)
plt.title('Lifetime Value Distribution')
plt.xlabel('Lifetime Value')
plt.ylabel('Frequency')
plt.show()
telecom_data['LTV'] = ltv_df['LTV']
telecom_data[['customer_id', 'LTV']].head()
| customer_id | LTV | |
|---|---|---|
| 0 | 1 | -1.500345e+08 |
| 1 | 2 | 3.500385e+09 |
| 2 | 3 | 1.747473e+08 |
| 3 | 4 | 1.620697e+09 |
| 4 | 5 | 3.614089e+08 |
ltv_df[['customer_id','LTV']].head()
| customer_id | LTV | |
|---|---|---|
| 0 | 1 | -1.500345e+08 |
| 1 | 2 | 3.500385e+09 |
| 2 | 3 | 1.747473e+08 |
| 3 | 4 | 1.620697e+09 |
| 4 | 5 | 3.614089e+08 |
# Correlation matrix
#numeric_data = telecom_data.select_dtypes(include=['number'])
numeric_data = telecom_data.select_dtypes(include=['int64', 'float64'])
correlation_matrix = numeric_data.corr()
# Heatmap of the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='rainbow')
plt.title('Correlation Matrix')
plt.show()
# Identify categorical columns
categorical_cols = telecom_data.select_dtypes(include=['object']).columns
# Identify numerical columns
numerical_cols = telecom_data.select_dtypes(include=['int64', 'float64']).columns.drop('churn')
# Preprocess the data
# Define the preprocessing for numerical features
numerical_transformer = SimpleImputer(strategy='median')
# Define the preprocessing for categorical features
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# Bundle preprocessing for numerical and categorical features
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_cols),
('cat', categorical_transformer, categorical_cols)
])
# Define the model
model = GradientBoostingClassifier(random_state=42)
# Create and evaluate the pipeline
clf = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', model)
])
# Prepare the features and target variable
X = telecom_data.drop(columns=['churn'])
y = telecom_data['churn']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train the model
clf.fit(X_train, y_train)
# Make predictions on the test set
y_pred = clf.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
# Print evaluation metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
print("Confusion Matrix:\n", conf_matrix)
Accuracy: 0.7991418776046478 Precision: 0.0 Recall: 0.0 F1 Score: 0.0 Confusion Matrix: [[38927 1] [ 9783 0]]
# Identify categorical columns
categorical_cols = X_train.select_dtypes(include=['object']).columns
# Preprocess the categorical features
preprocessor = ColumnTransformer(
transformers=[
('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
])
# Define the RandomForestClassifier model with preprocessor
rf_classifier = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
# Train the model
rf_classifier.fit(X_train, y_train)
# Make predictions
y_pred = rf_classifier.predict(X_test)
We'll start by choosing a model for LTV prediction. Considering the nature of the problem (predicting a continuous value), regression models such as Linear Regression, Decision Trees, Gradient Boosting, or even Neural Networks can be suitable choices.
Let's choose Gradient Boosting Regressor as it often performs well with complex datasets and can capture nonlinear relationships.
# Prepare the features and target variable
X = telecom_data[['recency', 'frequency', 'monetary']]
y = telecom_data['LTV']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize the Gradient Boosting Regressor model
gb_regressor = GradientBoostingRegressor()
# Train the model
gb_regressor.fit(X_train, y_train)
GradientBoostingRegressor()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
GradientBoostingRegressor()
We'll evaluate the trained model using metrics such as Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE).
# Make predictions on the test set
y_pred = gb_regressor.predict(X_test)
# Calculate RMSE and MAE
rmse = mean_squared_error(y_test, y_pred, squared=False)
mae = mean_absolute_error(y_test, y_pred)
print("Root Mean Squared Error (RMSE):", rmse)
print("Mean Absolute Error (MAE):", mae)
Root Mean Squared Error (RMSE): 64992491.76284695 Mean Absolute Error (MAE): 50102285.34452021
The RMSE and MAE values provide an understanding of the predictive accuracy of our model:
Model Accuracy:
Error Magnitude:
Model Improvement:
Business Implications:
Diagnostic Plots:
Cross-Validation:
Feature Importance Analysis:
Model Tuning and Selection:
By addressing these insights and taking steps to refine the model, we can enhance the reliability of our LTV predictions and make more informed business decisions.
#Plot Actual vs. Predicted Values
'''
Visualizing the actual vs. predicted values to help us understand how well your model is performing.'''
plt.scatter(y_test, y_pred)
plt.xlabel('Actual LTV')
plt.ylabel('Predicted LTV')
plt.title('Actual vs Predicted LTV')
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red') # Line y=x
plt.show()
# Plot Residuals
'''Plotting the residuals (differences between actual and predicted values)
to help us identify patterns that might indicate model issues.'''
residuals = y_test - y_pred
plt.hist(residuals, bins=30)
plt.xlabel('Residual')
plt.ylabel('Frequency')
plt.title('Residuals Distribution')
plt.show()
#Feature Importance
#Understanding which features are most important to our model can provide insights into the model’s decisions.
feature_importances = pd.Series(gb_regressor.feature_importances_, index=X.columns)
feature_importances = feature_importances.sort_values(ascending=False)
plt.bar(feature_importances.index, feature_importances)
plt.xlabel('Feature')
plt.ylabel('Importance')
plt.title('Feature Importances')
plt.show()
#Lets add the neccesary columns in scaler_df for tesing.
scaler_df[['recency', 'frequency', 'monetary']] = telecom_data[['recency', 'frequency', 'monetary']]
scaler_df[['recency', 'frequency', 'monetary']]
| recency | frequency | monetary | |
|---|---|---|---|
| 0 | 1611 | -272 | 124962 |
| 1 | 1611 | 6074 | 130556 |
| 2 | 1611 | 266 | 148828 |
| 3 | 1611 | 9482 | 38722 |
| 4 | 1611 | 1486 | 55098 |
| ... | ... | ... | ... |
| 243548 | 393 | 4139 | 130580 |
| 243549 | 393 | 7646 | 82393 |
| 243550 | 393 | 6577 | 51298 |
| 243551 | 393 | 1220 | 83981 |
| 243552 | 392 | 3452 | 144297 |
243553 rows × 3 columns
# Lets test the model on scaler_df
new_data_predictions = gb_regressor.predict(scaler_df[['recency', 'frequency', 'monetary']])
print(new_data_predictions)
[1.06318661e+08 3.44108148e+09 2.98132766e+08 ... 3.58163306e+08 5.01801271e+07 7.05467931e+08]
# Identify categorical and numerical columns
categorical_cols = telecom_data.select_dtypes(include=['object']).columns
numerical_cols = telecom_data.select_dtypes(include=['int64', 'float64']).columns.drop('churn')
# Preprocess the data
numerical_transformer = SimpleImputer(strategy='median')
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_cols),
('cat', categorical_transformer, categorical_cols)
])
# Define the model
gb_classifier = GradientBoostingClassifier(random_state=42)
# Create and evaluate the pipeline
clf = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', gb_classifier)
])
# Prepare the features and target variable
X = telecom_data.drop(columns=['churn'])
y = telecom_data['churn']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train the model
clf.fit(X_train, y_train)
# Make predictions on the test set
y_pred = clf.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
# Print evaluation metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
print("Confusion Matrix:\n", conf_matrix)
Accuracy: 0.7991418776046478 Precision: 0.0 Recall: 0.0 F1 Score: 0.0 Confusion Matrix: [[38927 1] [ 9783 0]]
# Ensure telecom_data is already loaded and preprocessed
# Define categorical and numerical columns
categorical_cols = telecom_data.select_dtypes(include=['object']).columns
numerical_cols = telecom_data.select_dtypes(include=['int64', 'float64']).columns.drop('churn')
# Define transformers
numerical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# Define preprocessor
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_cols),
('cat', categorical_transformer, categorical_cols)
])
# Define the model
gb_classifier = GradientBoostingClassifier(random_state=42)
# Create pipeline
clf = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', gb_classifier)
])
# Prepare features and target variable
X = telecom_data.drop(columns=['churn'])
y = telecom_data['churn']
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train the model
clf.fit(X_train, y_train)
# Ensure scaler_df contains necessary columns
if not all(col in scaler_df.columns for col in ['recency', 'frequency', 'monetary']):
raise ValueError("scaler_df does not contain the necessary columns: 'recency', 'frequency', 'monetary'")
# Make predictions on the new data
new_data_predictions = clf.predict(scaler_df[['recency', 'frequency', 'monetary']])
# If you want the probabilities
new_data_probabilities = clf.predict_proba(scaler_df[['recency', 'frequency', 'monetary']])[:, 1]
# Print the predictions
print("Predicted Classes:", new_data_predictions)
print("Predicted Probabilities:", new_data_probabilities)